This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
library(tidyverse)
library(GGally)
library(caret)
library(moments)
library(corrplot)
library(ggmap)
library(arules)
library(xgboost)
Import and examine datatype of each features
dataset_path <- "/Users/truongminh/Downloads/SchoolWork/UW_Data_Analytics/Data_Mining/Final_Project"
houses <- read.delim(file.path(dataset_path,"kc_house_data.csv"),sep = ",", header = TRUE)
houses %>% glimpse()
## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 1954400…
## $ date <fct> 20141013T000000, 20141209T000000, 20150225T000000, 2014…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500…
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2…
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 18…
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, …
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0…
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4…
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, …
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 18…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0,…
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2…
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198,…
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 4…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2…
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,…
We will first examine the distribution of each numeric columns using histogram From these histogram we can see that very few numeric features are normaly distributed which will decrease the reliability of the model given that the model use a linear algorithmn.
corr_matrix <- houses %>% select(-c(lat,long,renovated,waterfront)) %>% cor()
corr_matrix %>% corrplot(method = "pie")
From this correlation map of the numeric features, we can see that: - features relating to sizes seems to influence the prices of the house the most, namely: “bedrooms”, “badrooms”, “sqft_living”, “sqft_lot”, “sqft_above”, “sqft_living15”, “grade”. However, we need to keep in mind that these features are also correlated with one another as well.Therefore, some of these features need to be filtered out to avoid multicolinearity.
#Feature extraction by filtering
houses <- houses %>% select(-c(sqft_living15, sqft_lot15, sqft_above,age))
Let’s furture explore the 3 discrete variables floors, bedrooms and bathrooms It seems that among the 3 variables, the number of bathrooms seems to be the feature that make the most significant impact to the price of the house
Seeing how price differs in each variable grouping using viloin plot
Each of these features has a visible effect on price. View seems to be the features where the effect on price is less distinctiive
Intuitively, it makes sense that the location of the house largely influence the price as well, let’s visualize the distribution of house prices on map using longtitude and latitude as metrics
We can see there is a pattern in how the higher priced houses are distributed. there is a high concentration of high price houses in the Ballard area, followed by West Seattle, Downtown Seattle, Mercer Island and Bellvue. Another interesting factor we can see from there is that a lot of the darker area are around waterfront housings
It’s clear how longtitude and lattitude influence house price. However, the current format of these variables make it difficult to extract meaningful information. Therefore, I’m gonna discretize tand group them into categories
houses$vert_loc <- discretize(houses$lat,
method = "frequency",
breaks = 3,
labels = c("South","Mid","North"))
houses$horz_loc <- discretize(houses$long,
method = "frequency",
breaks = 3,
labels = c("West","Mid","East"))
houses <- houses %>% select(-c(lat,long))
From examining the histogram in the previous section, we came to the conclusion that most of the numeric features are highly skewed and does not follow a normal distribution. Let’s check for their skewness
num_feat = houses %>% select(-c(waterfront,renovated, vert_loc, horz_loc)) %>% names()
apply(houses[num_feat],2,skewness) # table showing the skewness of each variable
## price bedrooms bathrooms sqft_living sqft_lot
## 4.0237899 1.9741625 0.5110721 1.4714533 13.0591125
## floors view condition grade sqft_basement
## 0.6161340 3.3955139 1.0327330 0.7710497 1.5778555
There are 2 highly skewed variables in this dataset: price and sqft_lot. To improve prediction accuracy, let’s change all of them to log scale so that they will fit the Gausian Distribution
# Convert skewed data to log scale
houses <- houses %>% mutate(sqft_lot = log10(sqft_lot),
price = log10(price))
horz_loc <- model.matrix(~horz_loc - 1, data = houses)
vert_loc <- model.matrix(~vert_loc - 1, data = houses)
waterfront <- model.matrix(~waterfront - 1, data = houses)
houses <- cbind(houses %>% select(-c(horz_loc, vert_loc, waterfront)), horz_loc, vert_loc, waterfront)
train <- createDataPartition(y = houses$price, p = 0.7, list = FALSE)
houses_train <- houses[train,]
houses_test <- houses[-train,]
prepro <- preProcess(houses_train[num_feat],method = c("nzv","center","scale"))
houses_train <- predict(prepro,houses_train)
houses_test <- predict(prepro,houses_test)
For this regression problem, I’m going to try using 2 method namely ridge regression and xgboost regression tree. - Ridge is a regularized regression method which is known for its simplicity, it penaltize the model adjusted with the bias-variance trade off which mean it has lower accuracy on training data which make it much less likely to overfit. This penalty will also allows it to has less variance on unknown data
set.seed(1808)
ridge_model <- train(price ~ .,
data = houses_train,
method = "ridge",
tuneGrid = expand.grid(lambda = seq(0, 1, 0.05)),
trControl = trainControl(method = "cv", 10))
xgboost_model <- train(price ~ .,
data = houses_train,
method = "xgbTree",
trControl = trainControl("cv", number = 10))
Compare MAE, RMSE and Rsquared score after cross validation of each model
compare <- resamples(list(xg_boost = xgboost_model,
ridge_regression = ridge_model))
bwplot(compare)
Cross valiation score on training data shows xg_boost model yielding slighty better accuracy. XGBoost has always been known for its great accuracy, the big drawback of it is its possibility of overfitting, that’s the main reason I decide to use ridge regression as my back up model as its compensate for these shortcoming of XGBoost in return for lower accuracy.
Let’s see how each model perform on the training dataset
compare_on_train <- resamples(list(ridge = ridge_model,
xgboost = xgboost_model))
compare_on_test <- data.frame(row.names = c('RMSE','RSquared','MAE'))
compare_on_test['ridge_model'] <- postResample(predict(ridge_model,houses_test),houses_test$price)
compare_on_test['xgboost_model'] <- postResample(predict(xgboost_model,houses_test),houses_test$price)
bwplot(compare_on_train)
compare_on_test
## ridge_model xgboost_model
## RMSE 0.4773049 0.4362115
## RSquared 0.7682076 0.8062938
## MAE 0.3671914 0.3262627
XGBoost still outperform the ridge model on unforseen data which once again prove its effectiveness. After examining their performance on the new training data, it’s safe to use the XGBoost model now.
In conclusion, with R-Squared at 0.8 the XGBoost model seems to be a good choice for predicting house prices. However, it worth mentioning that the scope of the variable used for these models are much larger than the variables given in the problem.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.